431 Class 25

Thomas E. Love, Ph.D.

2023-12-05

Getting Started

  • Some modeling with Favorite Movies from 2023-10-24.
    • Note: I downloaded Google Sheet to an Excel .xlsx file.

Definitely not before you finish Quiz 2.

  • I wrote the quiz and sketch in 4.3.1.
  • After that, it’s up to you. If you use version 4.3.1 or later for your Project B, I’m fine. I’ll start caring about 4.3.2 in 432.
  • You will probably want to update your packages either way.
  • I have completed the upgrade to 4.3.2. for these slides, as you can see in the Session Information.

Today’s Packages

library(here)
library(readxl)
library(janitor)
library(gt)
library(gtExtras)
library(mosaic)
library(patchwork)
library(naniar)
library(simputation)
library(mice)
library(mitml)
library(corrplot); library(ggmice) # new packages today
library(broom)

library(xfun)
library(tidyverse)

theme_set(theme_bw())

Ingesting the data

movie_raw <- read_excel(here("c25/data/movies_2023-10-24.xlsx"),
                        na = c("", "NA")) |> # otherwise only "" is recognized
  clean_names() |>
  type.convert(as.is = FALSE) |>   # convert all characters to factors
  mutate(film_id = as.character(film_id), 
         film = as.character(film))

movies <- movie_raw |>
  select(film_id, imdb_pct10, fc_pctwins, rt_audiencescore, 
         ebert, box_off_mult, budget, metascore, bw_rating, imdb_oscars, 
         mentions, dr_love, gen_1, bacon_1, lang_1, 
         drama, comedy, adventure, action, romance, fantasy, 
         sci_fi, crime, thriller, animation, family, mystery, 
         biography, music, horror, musical, war, history,
         sport, western, film)

dim(movies)
[1] 201  36

Quick Check of Ingest

summary(movies)
   film_id            imdb_pct10      fc_pctwins rt_audiencescore
 Length:201         Min.   : 3.80   Min.   :24   Min.   :28.00   
 Class :character   1st Qu.:11.60   1st Qu.:42   1st Qu.:76.00   
 Mode  :character   Median :15.60   Median :52   Median :86.00   
                    Mean   :17.53   Mean   :51   Mean   :81.91   
                    3rd Qu.:22.20   3rd Qu.:60   3rd Qu.:92.00   
                    Max.   :55.00   Max.   :79   Max.   :98.00   
                                                                 
     ebert        box_off_mult         budget            metascore     
 Min.   :1.000   Min.   : 0.0013   Min.   :   200000   Min.   :  9.00  
 1st Qu.:2.875   1st Qu.: 2.6000   1st Qu.: 12000000   1st Qu.: 61.00  
 Median :3.500   Median : 4.7000   Median : 30000000   Median : 72.00  
 Mean   :3.190   Mean   : 8.5418   Mean   : 59242257   Mean   : 71.35  
 3rd Qu.:4.000   3rd Qu.: 9.3000   3rd Qu.: 90000000   3rd Qu.: 84.00  
 Max.   :4.000   Max.   :73.7000   Max.   :356000000   Max.   :100.00  
 NA's   :25      NA's   :20        NA's   :19          NA's   :10      
   bw_rating      imdb_oscars         mentions     dr_love   gen_1  
 Min.   :0.000   Min.   : 0.0000   Min.   :1.000   No :124   F: 45  
 1st Qu.:1.000   1st Qu.: 0.0000   1st Qu.:1.000   Yes: 77   M:156  
 Median :3.000   Median : 0.0000   Median :1.000                    
 Mean   :2.135   Mean   : 0.9849   Mean   :1.249                    
 3rd Qu.:3.000   3rd Qu.: 1.0000   3rd Qu.:1.000                    
 Max.   :3.000   Max.   :11.0000   Max.   :6.000                    
 NA's   :8       NA's   :2                                          
    bacon_1           lang_1        drama            comedy      
 Min.   :1.000   English :177   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.000   Japanese:  7   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :2.000   Hindi   :  5   Median :1.0000   Median :0.0000  
 Mean   :1.886   Italian :  2   Mean   :0.5721   Mean   :0.3582  
 3rd Qu.:2.000   Arabic  :  1   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :3.000   ASL     :  1   Max.   :1.0000   Max.   :1.0000  
                 (Other) :  8                                    
   adventure          action          romance          fantasy      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.3333   Mean   :0.2537   Mean   :0.1692   Mean   :0.1393  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
                                                                    
     sci_fi           crime           thriller        animation      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.1244   Mean   :0.1045   Mean   :0.1045   Mean   :0.08955  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
                                                                     
     family           mystery         biography           music        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :0.00000  
 Mean   :0.08955   Mean   :0.0597   Mean   :0.05473   Mean   :0.05473  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
                                                                       
     horror          musical             war            history       
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
 Mean   :0.0398   Mean   :0.02985   Mean   :0.0199   Mean   :0.01493  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
                                                                      
     sport            western             film          
 Min.   :0.00000   Min.   :0.000000   Length:201        
 1st Qu.:0.00000   1st Qu.:0.000000   Class :character  
 Median :0.00000   Median :0.000000   Mode  :character  
 Mean   :0.01493   Mean   :0.004975                     
 3rd Qu.:0.00000   3rd Qu.:0.000000                     
 Max.   :1.00000   Max.   :1.000000                     
                                                        

Data Cleaning

  1. Let’s convert budget to express it in millions of US dollars
  2. lang_eng should show English (n = 177) vs. Non-English
movies <- movies |>
  mutate(budget = budget / 1000000,
         lang_eng = fct_lump_n(lang_1, n = 1, other_level = "Non-English"))

favstats(~ budget, data = movies) |> gt() |> 
  fmt_number(columns = mean:sd, decimals = 2) |> gt_theme_dark()
min Q1 median Q3 max mean sd n missing
0.2 12 30 90 356 59.24 68.02 182 19
movies |> tabyl(lang_eng, lang_1) |> gt() |> gt_theme_dark()
lang_eng Arabic ASL Bengali Danish English French German Hindi Italian Japanese Mandarin Norwegian Persian Spanish
English 0 0 0 0 177 0 0 0 0 0 0 0 0 0
Non-English 1 1 1 1 0 1 1 5 2 7 1 1 1 1

Which outcome shall we choose?

We’re interested in a percentage measure (0-100) addressing how beloved the movie is, according to an audience.

Variable NA Description
imdb_pct10 0 % of 10-star public ratings in IMDB as of 2023-09
fc_pctwins 0 % of matchups won on Flickchart as of 2023-10
rt_audiencescore 0 Rotten Tomatoes Audience Score (% Fresh) as of 2023-10

Which outcome shall we choose?

p1 <- ggplot(data = movies, aes(x = imdb_pct10)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 2, 
                 fill = "navy", col = "white") +
  scale_x_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$imdb_pct10),
                            sd = sd(movies$imdb_pct10)),
                col = "magenta", linewidth = 1.5)

p2 <- ggplot(data = movies, aes(x = fc_pctwins)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 2, 
                 fill = "navy", col = "white") +
  scale_x_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$fc_pctwins),
                            sd = sd(movies$fc_pctwins)),
                col = "magenta", linewidth = 1.5)

p3 <- ggplot(data = movies, aes(x = rt_audiencescore)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 2, 
                 fill = "navy", col = "white") +
  scale_x_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$rt_audiencescore),
                            sd = sd(movies$rt_audiencescore)),
                col = "magenta", linewidth = 1.5)

p1 / p2 / p3

Which outcome shall we choose?

Which outcome shall we choose?

df_stats(~ imdb_pct10 + fc_pctwins + rt_audiencescore, data = movies) |>
  gt() |> fmt_number(columns = min:sd, decimals = 1) |> gt_theme_dark()
response min Q1 median Q3 max mean sd n missing
imdb_pct10 3.8 11.6 15.6 22.2 55.0 17.5 9.1 201 0
fc_pctwins 24.0 42.0 52.0 60.0 79.0 51.0 12.6 201 0
rt_audiencescore 28.0 76.0 86.0 92.0 98.0 81.9 13.5 201 0

Available Predictors for fc_pctwins

str(movies |> select(-fc_pctwins))
tibble [201 × 36] (S3: tbl_df/tbl/data.frame)
 $ film_id         : chr [1:201] "M001" "M002" "M003" "M004" ...
 $ imdb_pct10      : num [1:201] 37.3 27.2 14.9 30.1 22.1 17.1 22.7 22.5 21.2 32.4 ...
 $ rt_audiencescore: int [1:201] 93 92 69 89 84 81 94 95 82 92 ...
 $ ebert           : num [1:201] NA 4 2.5 4 4 2.5 4 4 4 2.5 ...
 $ box_off_mult    : num [1:201] 9.1 NA 1.8 5.5 NA 7.3 9.7 2.9 12.3 7.9 ...
 $ budget          : num [1:201] 6.63 NA 30 12 NA ...
 $ metascore       : int [1:201] 67 93 70 84 87 55 89 87 83 78 ...
 $ bw_rating       : int [1:201] 1 3 3 0 3 3 3 3 3 3 ...
 $ imdb_oscars     : int [1:201] 0 2 0 1 0 0 1 8 3 0 ...
 $ mentions        : int [1:201] 1 1 1 1 1 2 1 1 1 1 ...
 $ dr_love         : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 2 1 2 ...
 $ gen_1           : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 2 2 2 ...
 $ bacon_1         : int [1:201] 3 2 2 2 2 2 2 2 2 2 ...
 $ lang_1          : Factor w/ 14 levels "Arabic","ASL",..: 8 9 5 5 13 5 5 5 5 5 ...
 $ drama           : int [1:201] 1 1 1 0 1 1 0 1 0 0 ...
 $ comedy          : int [1:201] 1 0 1 0 0 1 0 0 0 0 ...
 $ adventure       : int [1:201] 0 0 0 1 0 0 0 0 1 1 ...
 $ action          : int [1:201] 0 0 0 0 0 0 0 0 1 1 ...
 $ romance         : int [1:201] 0 0 1 0 0 0 0 0 0 0 ...
 $ fantasy         : int [1:201] 0 0 0 0 0 1 0 0 1 0 ...
 $ sci_fi          : int [1:201] 0 0 0 1 0 0 1 0 0 1 ...
 $ crime           : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ thriller        : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ animation       : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ family          : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ mystery         : int [1:201] 0 0 0 0 1 0 0 0 0 0 ...
 $ biography       : int [1:201] 0 0 0 0 0 0 0 1 0 0 ...
 $ music           : int [1:201] 0 0 0 0 0 0 0 1 0 0 ...
 $ horror          : int [1:201] 0 0 0 0 0 0 1 0 0 0 ...
 $ musical         : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ war             : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ history         : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ sport           : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ western         : int [1:201] 0 0 0 0 0 0 0 0 0 0 ...
 $ film            : chr [1:201] "3 Idiots" "8 1/2" "10 Things I Hate About You" "2001: A Space Odyssey" ...
 $ lang_eng        : Factor w/ 2 levels "English","Non-English": 2 2 1 1 2 1 1 1 1 1 ...

First Cut: 18 predictors

Quantities

  • imdb_pct10, rt_audiencescore, box_off_mult, budget, metascore

Counts or Multi-categorical and ordinal

  • ebert, imdb_oscars, mentions, bw_rating, bacon_1

Binary Categorical

  • dr_love, gen_1, lang_eng, and
  • top five genres: drama, comedy, adventure, action, romance

How many predictors can we use?

If we have a linear regression model with 201 observations (at most, some variables are missing, remember), then how many predictors can we realistically fit?

A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.

See https://hbiostat.org/bbr/ Frank Harrell, Biostatistics for Biomedical Research

How Many Predictors (at maximum)?

  • The model will run, so long as you have more observations than cases, but that’s not a good standard to use.
  • Bigger samples are better, but sample size is often determined by pragmatic considerations.

A more modern (and complex) answer is found in this Riley et al (2020) BMJ article.

201 / 15 = 13.4 coefficients

That’s really a maximum, though. We’d like to avoid fitting more than perhaps 10 coefficients (including the intercept)…

  • Each quantitative predictor requires one coefficient
  • Each binary predictor also requires one coefficient
  • When treated as multi-categorical, a factor with k levels requires k-1 coefficients

So what shall we do?

Second Cut: 9 predictors

Variable Type Description
imdb_pct10 Quant % of 10-star public ratings in IMDB
rt_audiencescore Quant Rotten Tomatoes Audience Score (% Fresh)
box_off_mult Quant World Wide Gross Revenue (as multiple of budget)
metascore Quant Metascore (0-100 scale) from critic reviews
imdb_oscars Quant # of Oscar (Academy Award) wins
bw_rating Quant Bechdel-Wallace Test Criteria Met (0-3)
lang_eng Binary Is primary language English? (1 = Yes, 0 = No)
drama Binary Is drama listed in imdb_categories? (1 = Yes, 0 = No)
comedy Binary Is comedy listed in imdb_categories? (1 = Yes, 0 = No)
  • 10 coefficients x 15 = 150 observations needed, at minimum. We have 201.

Create movies_2

movies_2 <- movies |>
  select(film_id, fc_pctwins, imdb_pct10, rt_audiencescore,
         box_off_mult, metascore, imdb_oscars, bw_rating,
         lang_eng, drama, comedy, film)

dim(movies_2)
[1] 201  12
  • Two identifiers (film and film_id)
  • Our outcome fc_pctwins
  • Our nine predictors (from previous slide)

How much missingness do we have?

miss_case_table(movies_2) |> gt() 
n_miss_in_case n_cases pct_cases
0 175 87.064677
1 16 7.960199
2 6 2.985075
3 4 1.990050
miss_var_summary(movies_2) |> filter(n_miss > 0) |> gt()
variable n_miss pct_miss
box_off_mult 20 9.9502488
metascore 10 4.9751244
bw_rating 8 3.9800995
imdb_oscars 2 0.9950249

Can we assume MCAR and just do a complete case analysis?

mcar_test(movies_2) |> gt() 
statistic df p.value missing.patterns
114.8076 83 0.01192347 9
  • Note that we’re hurt here by reducing our data set to 12 variables, but that’s the way the cookie crumbles.
mcar_test(movies) |> gt()
statistic df p.value missing.patterns
434.3851 414 0.2356999 13

No data-driven variable selection

Why not?

  • We can’t afford it. Too small a ratio of sample size to predictors.

What about validation?

  • We don’t want to split our sample, certainly.
  • There are other methods (coming in 432) which would help, like k-fold cross validation and bootstrap validation.

Can we do multiple imputation from the start?

Let’s create 25 imputations

To come.

Look at Imputation 6

  • Correlation matrix (from R Graphics Cookbook)
  • ggmice demonstration of relationship between fc_pctwins and box_office_mult
  • Scatterplot matrix

Build our model in Imputation 6

To come.

tidied Coefficients in Imputation 6

To come.

glance summary in Imputation 6

To come.

Residual Plots: Imputation 6

To come.

Pooled Coefficient Estimates (all 25 imputations)

To come.

\(R^2\) and adjusted \(R^2\) after Pooling

To come.

Conclusions?

To come.

Session Information

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4.5          askpass_1.2.0        backports_1.4.1     
  base64enc_0.1.3      bigD_0.2.0           bit_4.0.5           
  bit64_4.0.5          bitops_1.0.7         blob_1.2.4          
  boot_1.3-28.1        brio_1.1.3           broom_1.0.5         
  bslib_0.6.1          cachem_1.0.8         callr_3.7.3         
  car_3.1.2            carData_3.0.5        cellranger_1.1.0    
  class_7.3.22         cli_3.6.1            clipr_0.8.0         
  codetools_0.2-19     colorspace_2.1-0     commonmark_1.9.0    
  compiler_4.3.2       conflicted_1.2.0     corrplot_0.92       
  cpp11_0.4.7          crayon_1.5.2         curl_5.1.0          
  data.table_1.14.8    DBI_1.1.3            dbplyr_2.4.0        
  DEoptimR_1.1.3       desc_1.4.2           diffobj_0.3.5       
  digest_0.6.33        doRNG_1.8.6          dplyr_1.1.4         
  dtplyr_1.3.1         e1071_1.7.13         ellipsis_0.3.2      
  evaluate_0.23        fansi_1.0.5          farver_2.1.1        
  fastmap_1.1.1        fontawesome_0.5.2    forcats_1.0.0       
  foreach_1.5.2        fs_1.6.3             gargle_1.5.2        
  generics_0.1.3       ggformula_0.12.0     ggmice_0.1.0        
  ggplot2_3.4.4        ggridges_0.5.4       glmnet_4.1-8        
  glue_1.6.2           googledrive_2.1.1    googlesheets4_1.1.1 
  gower_1.0.1          graphics_4.3.2       grDevices_4.3.2     
  grid_4.3.2           gridExtra_2.3        gt_0.10.0           
  gtable_0.3.4         gtExtras_0.5.0       haven_2.5.4         
  here_1.0.1           highr_0.10           hms_1.1.3           
  htmltools_0.5.7      htmlwidgets_1.6.3    httr_1.4.7          
  ids_1.0.1            isoband_0.2.7        iterators_1.0.14    
  itertools_0.1.3      janitor_2.2.0        jomo_2.7-6          
  jquerylib_0.1.4      jsonlite_1.8.7       juicyjuice_0.1.0    
  knitr_1.45           labeling_0.4.3       labelled_2.12.0     
  laeken_0.5.2         lattice_0.21-9       lifecycle_1.0.4     
  lme4_1.1-35.1        lmtest_0.9.40        lubridate_1.9.3     
  magrittr_2.0.3       markdown_1.11        MASS_7.3-60         
  Matrix_1.6-4         MatrixModels_0.5.3   memoise_2.0.1       
  methods_4.3.2        mgcv_1.9.0           mice_3.16.0         
  mime_0.12            minqa_1.2.6          missForest_1.5      
  mitml_0.4-5          modelr_0.1.11        mosaic_1.9.0        
  mosaicCore_0.9.4.0   mosaicData_0.20.4    munsell_0.5.0       
  naniar_1.0.0         nlme_3.1-163         nloptr_2.0.3        
  nnet_7.3-19          norm_1.0-11.1        numDeriv_2016.8.1.1 
  openssl_2.1.1        ordinal_2022.11.16   paletteer_1.5.0     
  pan_1.9              parallel_4.3.2       patchwork_1.1.3     
  pbkrtest_0.5.2       pillar_1.9.0         pkgbuild_1.4.2      
  pkgconfig_2.0.3      pkgload_1.3.3        plyr_1.8.9          
  praise_1.0.0         prettyunits_1.2.0    prismatic_1.1.1     
  processx_3.8.2       progress_1.2.2       proxy_0.4.27        
  ps_1.7.5             purrr_1.0.2          quantreg_5.97       
  R6_2.5.1             ragg_1.2.6           randomForest_4.7.1.1
  ranger_0.16.0        rappdirs_0.3.3       RColorBrewer_1.1.3  
  Rcpp_1.0.11          RcppEigen_0.3.3.9.4  reactable_0.4.4     
  reactR_0.5.0         readr_2.1.4          readxl_1.4.3        
  rematch_2.0.0        rematch2_2.1.2       reprex_2.0.2        
  rlang_1.1.2          rmarkdown_2.25       rngtools_1.5.2      
  robustbase_0.99.1    rpart_4.1.21         rprojroot_2.0.4     
  rstudioapi_0.15.0    rvest_1.0.3          sass_0.4.7          
  scales_1.3.0         selectr_0.4.2        shape_1.4.6         
  simputation_0.2.8    snakecase_0.11.1     sp_2.1.2            
  SparseM_1.81         splines_4.3.2        stats_4.3.2         
  stringi_1.8.2        stringr_1.5.1        survival_3.5-7      
  sys_3.4.2            systemfonts_1.0.5    testthat_3.2.1      
  textshaping_0.3.7    tibble_3.2.1         tidyr_1.3.0         
  tidyselect_1.2.0     tidyverse_2.0.0      timechange_0.2.0    
  tinytex_0.49         tools_4.3.2          tzdb_0.4.0          
  ucminf_1.2.0         UpSetR_1.4.0         utf8_1.2.4          
  utils_4.3.2          uuid_1.1.1           V8_4.4.0            
  vcd_1.4.11           vctrs_0.6.5          VIM_6.2.2           
  viridis_0.6.4        viridisLite_0.4.2    visdat_0.6.0        
  vroom_1.6.4          waldo_0.5.2          withr_2.5.2         
  xfun_0.41            xml2_1.3.5           yaml_2.3.7          
  zoo_1.8.12